BIOL5081: Biostatistics Intro

The first six weeks of the course will cover data science & fundamental exploratory data analysis in r. It is a brief introduction to the contemporary open science, best-practice data and biostatistical working tools. The primary goal is exploration of tools and approaches associated with effective, efficient, reproducible biostatistical analyses. Inspirations include software carpentry, rforcats, and many, more open, online, free I can direct you to as needed. The description provided by the university is a useful starting point in deciding if the Fall 2016 offering meets your specific needs.

use.ful

bio.stats

adventure.time

Course outline

Lesson 1. Data science (DS).

want data. have data. will collect data. assumption: in this course, you are here to work with data. data literacy IS data science.

WhyR introductory lecture

The importance of data viz

Philosophy of R stats Statistical thinking: likelihood, error, and effect sizes. Contemporary statisticians embrace and are mindful of these three key concepts in all research, data, and statistical inference examinations.

Modes of inference with your data: using data, what can you infer or do? 1. Data description 2. Likelihood 3. Estimation 4. Baysian inference - weight and include what we know 5. Prediction 6. Hypothesis testing 7. Decision making -balance gains and risks

This set of ideas provide the foundation for many data science and statistical approached to working with evidence within almost every domain of science.

Data viz first and foremost. Case study #1.

#blog post by Fisseha Berhane
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)

#Create four groups: setA, setB, setC and setD.
setA=select(anscombe, x=x1,y=y1)
setB=select(anscombe, x=x2,y=y2)
setC=select(anscombe, x=x3,y=y3)
setD=select(anscombe, x=x4,y=y4)

#Add a third column which can help us to identify the four groups.
setA$group ='SetA'
setB$group ='SetB'
setC$group ='SetC'
setD$group ='SetD'

#merge the four datasets
all_data=rbind(setA,setB,setC,setD)  # merging all the four data sets
all_data[c(1,13,23,43),]  # showing sample
##     x    y group
## 1  10 8.04  SetA
## 13  8 8.14  SetB
## 23 10 7.46  SetC
## 43  8 7.91  SetD
#compare summary stats
summary_stats =all_data%>%group_by(group)%>%summarize("mean x"=mean(x),
                                       "Sample variance x"=var(x),
                                       "mean y"=round(mean(y),2),
                                       "Sample variance y"=round(var(y),1),
                                       'Correlation between x and y '=round(cor(x,y),2)
                                      )
models = all_data %>% 
      group_by(group) %>%
      do(mod = lm(y ~ x, data = .)) %>%
      do(data.frame(var = names(coef(.$mod)),
                    coef = round(coef(.$mod),2),
                    group = .$group)) %>%
dcast(., group~var, value.var = "coef")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
summary_stats_and_linear_fit = cbind(summary_stats, data_frame("Linear regression" =
                                    paste0("y = ",models$"(Intercept)"," + ",models$x,"x")))

summary_stats_and_linear_fit
##   group mean x Sample variance x mean y Sample variance y
## 1  SetA      9                11    7.5               4.1
## 2  SetB      9                11    7.5               4.1
## 3  SetC      9                11    7.5               4.1
## 4  SetD      9                11    7.5               4.1
##   Correlation between x and y  Linear regression
## 1                         0.82      y = 3 + 0.5x
## 2                         0.82      y = 3 + 0.5x
## 3                         0.82      y = 3 + 0.5x
## 4                         0.82      y = 3 + 0.5x
#data viz instead as first step
ggplot(all_data, aes(x=x,y=y)) +geom_point(shape = 21, colour = "red", fill = "orange", size = 3)+
    ggtitle("Anscombe's data sets")+geom_smooth(method = "lm",se = FALSE,color='blue') + 
    facet_wrap(~group, scales="free")

Outcome from stats first, data viz later (tricked), descriptive estimates of data can be deceptive. Draw first, then interpret.

Survey data from class. Case study #2.

#load class survey responses from google poll completed in class
survey<-read.csv("data/5081.survey.1.csv")
str(survey) #check data match what we collected
## 'data.frame':    18 obs. of  4 variables:
##  $ r.experience : int  2 1 2 3 1 1 3 1 1 3 ...
##  $ discipline   : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
##  $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ r.studio     : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
#data viz
hist(survey$r.experience, xlab="experience in R (1 is none, 5 is extensive)", ylab="frequency", main="Likert Scale 1 to 5")

plot(survey$r.experience~survey$discipline, xlab="discipline", ylab="experience in R")

plot(survey$r.studio, ylab="R Studio")

plot(survey$research.data, ylab="Research data")

#observe patterns by checking plots

Observations from data viz We have limited experience in R. Experience in R varies by research discipline. A total of half the respondents have tried R Studio. Most participants will be working with quantitative data in their own research projects.

#Now, try some simple summary statistics.
summary(survey)
##   r.experience                   discipline      research.data r.studio
##  Min.   :1.000   ecology              :6    qualitative : 1    No :9   
##  1st Qu.:1.000   environmental science:1    quantitative:17    Yes:9   
##  Median :1.000   genetics             :2                               
##  Mean   :1.667   physiology           :9                               
##  3rd Qu.:2.000                                                         
##  Max.   :3.000
#Data summary looks reasonable based on plots, mean R experience is < 2
t.test(survey$r.experience, mu=1) #t-test if mean is different from 1
## 
##  One Sample t-test
## 
## data:  survey$r.experience
## t = 3.3665, df = 17, p-value = 0.003664
## alternative hypothesis: true mean is not equal to 1
## 95 percent confidence interval:
##  1.248861 2.084472
## sample estimates:
## mean of x 
##  1.666667
t.test(survey$r.experience, mu=2) #t-test if mean is different from 2
## 
##  One Sample t-test
## 
## data:  survey$r.experience
## t = -1.6833, df = 17, p-value = 0.1106
## alternative hypothesis: true mean is not equal to 2
## 95 percent confidence interval:
##  1.248861 2.084472
## sample estimates:
## mean of x 
##  1.666667
#A one sample t-test confirms we have a bit experience in R.

m1<-glm(r.experience~discipline, family = poisson, data = survey) #test for differenes between disciplines in R experience
m1 #model summary
## 
## Call:  glm(formula = r.experience ~ discipline, family = poisson, data = survey)
## 
## Coefficients:
##                     (Intercept)  disciplineenvironmental science  
##                       5.108e-01                       -5.108e-01  
##              disciplinegenetics             disciplinephysiology  
##                       1.823e-01                       -6.678e-10  
## 
## Degrees of Freedom: 17 Total (i.e. Null);  14 Residual
## Null Deviance:       6.808 
## Residual Deviance: 6.371     AIC: 56.79
anova(m1, test="Chisq") #test whether the differences in model are different
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: r.experience
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                          17     6.8075         
## discipline  3  0.43692        14     6.3706   0.9325
#Too little evidence to be significantly different between disciplines.

Practical outcomes of R stats useful for competency test Understand the difference between R and R Studio. Use scripts or R Markdown files to save all your work. Be prepared to share you code. Load data, clean data, visualize data, then and only then begin applying statistics. Proximately: be able to use and work with dataframes, vectors, functions, and libraries in R.

Lesson 2. Workflows & Data Wrangling (WDW).

worflows reproduce. openly.

data wrangling more than half the battle.

PK talk here :)

Philosophy of R stats Tidy data make your life easier. Data strutures should match intuition and common sense. Data should have logical structure. Rows are are observations, columns are variables. Tidy data also increase the viability that others can use your data, do better science, reuse science, and help you and your ideas survive and thrive. A workflow should also include the wrangling you did to get your data ready. If you are data are already very clean in a spreadsheet and easily become a literate, logical dataframe, you should still use annotation within the introductory code to explain the meta-data of your data to some extent and what you did pre-R to get it tidy. The philosophy here is very similar to the data viz lesson forthcoming with two dominant paradigms. Base R code functions, and pipes %>% and the logic embodied within the libraries associated with the the tidyverse. Generally, I prefer the tidyverse because it is more logical and much easier to remember. It also has some specific functions needed to address common errors in modern data structures with little code needed to fix them up.

Worflow Suggested example

Data wrangling

Base R key concepts aggregate tapply sapply lappy subsetting as.factor is.numeric na

tidyverse pipes

Cheatsheet

tidyr

Practical outcomes of R stats useful for competency test

Lesson 3. Visualization in r (VR).

basic plots. lattice. ggplot2. you need to see data. see the trends. explore them using visuals.

Contemporary data viz for statistical analyses slide deck

Philosophy of R stats Clean simple graphics are powerful tools in statistics (and in scientific communication). Tufte and others have shaped data scientists and statisticians in developing more libraries, new standards, and assumptions associated with graphical representations of data. Data viz must highlight the differences, show underlying data structures, and provide insights into the specific research project. R is infinitely customizable in all these respects. There are at least two major current paradigms (there are more these are the two dominant idea sets). Base R plots are simple, relatively flexible, and very easy. However, their grammar, i.e their rules of coding are not modern. Ggplot and related libraries invoke a new, formal grammar of graphics that is more logical, more flexible, but divergent from base R code. It is worth the time to understand the differences and know when to use each.

Evolution of plotting in statistics using R in particular went from base-R then onto lattice then to the ggvis universe with the most recent library being ggplot2. Base-R is certainly useful in some contexts as is the lattice and lattice extra library. However, ggplot2 now encompasses all these capacities with a much simpler set of grammar (i.e. rules and order). Nonetheless, you should be able to read base-R code for plots and be able to do some as well. The philosophy or grammar of modern graphics is well articulated and includes the following key principles.

The grammar of graphics layers primacy of layers (simple first, then more complex) i.e. you build up your plots data are mapped to aesthetic attributes and geometric objects data first then statistics even in plots

Disclaimer: I love the power of qplots.

Data viz case study #1.

library(ggplot2)
survey<-read.csv("data/5081.survey.1.csv")
str(survey)
## 'data.frame':    18 obs. of  4 variables:
##  $ r.experience : int  2 1 2 3 1 1 3 1 1 3 ...
##  $ discipline   : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
##  $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ r.studio     : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
plot(survey$r.experience) #hard to tell what is going on

qplot(r.experience, data=survey) #decided to make bins for me and count up)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#so, we know better and instead do a histogram using base graphics
#basic data viz for EDA
hist(survey$r.experience) #better

qplot(r.experience, data=survey, geom="histogram") #same as what is picked for us
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(r.experience, data=survey, geom="histogram", binwidth=0.5)

barplot(survey$r.experience) #confusing

qplot(r.experience, data=survey, geom="bar") #what, it is back!

#basic data viz for EDA but for interactions
plot(survey$discipline, survey$r.experience)

qplot(discipline, r.experience, data=survey) #not the same

qplot(discipline, r.experience, data=survey, geom="boxplot")

plot(survey$r.studio~survey$r.experience) #ugly

qplot(r.experience, r.studio, data=survey) #useless

qplot(r.studio, data=survey, weight = r.experience) #sweet new feature here

#ok, so you get it. grammar different, visuals about the same for super quick, simple plots. The grammar hints at the power that awaits though.

#grammar different, simple x or x~y plots about the same

Data viz case study #2.

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#crazy number of observations. We need less. too many riches not always good.
set.seed(1000)
dsmall<-diamonds[sample(nrow(diamonds), 100), ]

plot(dsmall$carat, dsmall$price)

qplot(carat, price, data=dsmall)

#ok no difference
#now let's see what we can do with qplot with a few bits added
#one little argument extra added
qplot(carat, price, data = dsmall, colour = color)

qplot(carat, price, data = dsmall, shape = cut)

#how about using data viz to even more thoroughly explore potential stats we could do.
#qplots - quick plot, thoughful build of layers
qplot(carat, price, data = dsmall, geom = c("point", "smooth"))

#what about trying some stats on this now, at least from a viz philosophy
qplot(color, price / carat, data = dsmall, geom = "boxplot") #can include formulas and methods

#or check for proportional differences
qplot(carat, data = dsmall, geom = "histogram", fill = color) #to see proportions
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(carat, data = dsmall, geom = "histogram", weight = price) # weight by a covariate
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#final idea, how about subsetting with the data right in the code for the plots!
qplot(carat, data = diamonds, facets = color ~ .,
  geom = "histogram", binwidth = 0.1, xlim = c(0, 3)) #to compare between groups
## Warning: Removed 32 rows containing non-finite values (stat_bin).

#qplot is so powerful.
#colour, shape and size have meaning in the R code from this library
#layers added for you by qplots

#qplot gets you 2x and one y or one x and 2y so >2 variables at once easily

Data viz case study #3.

#GGPLOT() gives you even more options for adding layers/options
p <- ggplot(mtcars, aes(x = mpg, y = wt))
p + geom_point()

#now play time with this case study.
#try out some geom options and different aesthetics and make some errors.
#prize for the prettiest plots

#displ is car engine size in Liters
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

#so aethetics are one way to add variables in and expand your plotting power
#however facets are another way to make multiple plots BY a factor

#facet wrap is by one variable
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

#facet_wrap(~cell) - univariate: create a 1-d strip of panels, based on one factor, and wrap the strip into a 2-d matrix
#facet_grid(row~col) - (usually) bivariate: create a 2-d matrix of panels, based on two factors

#facet grid is by two variables
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

#another example more perfect code
p <- ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) + geom_point()
p + facet_wrap(~cyl)

#or just use facets in qplot but much simpler
qplot(displ, hwy, data=mpg, facets = . ~ year) + geom_smooth()

Data viz case study #4.

#try it with ecological.footprints.csv

Practical outcomes of R stats useful for competency test Do meaningful plot of a single variable wth nice labels. Do a meaningful plot of a relationship between two variables. Use qplots to do a plot that includes an aesthetic. Do a ggplot that includes three variables.

Lesson 4. Exploratory data analysis (EDA) in r.

fundamental descriptive stats. GLM. GLMM. Post hoc tests.

Lesson 5. Wrangle, visualize, and analyze.

A graduate-level dataset. Apply your new mathemagical skills from scratch. A single three-hour block. As advanced as you want to be. Fundamental principles demonstrated. At least two fundamental statistical tests demonstrated.

Lesson 6. Competency test.

deliberate practice. tested with feedback. a virtual repeat of last week but evaluated